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Abstract 

A basic problem in spectral clustering is the following. If a solution obtained from the 
spectral relaxation is close to an integral solution, is it possible to find this integral solution even 
though they might be in completely different basis? In this paper, we propose a new spectral 
clustering algorithm. It can recover a fc-partition such that the subspace corresponding to the 
span of its indicator vectors is O(VOPT) close to the original subspace in spectral norm with 
OPT being the minimum possible (OPT < 1 always). Moreover our algorithm does not impose 
any restriction on the cluster sizes. Previously, no algorithm was known which could find a 
fc-partition closer than o(k ■ OPT). 

We present two applications for our algorithm. First one finds a disjoint union of bounded 
degree expanders which approximate a given graph in spectral norm. The second one is for 
approximating the sparsest A;-partition in a graph where each cluster have expansion at most 
(t>} : provided < 0(Xk+i ) where A/ i:+1 is the (k + l) st eigenvalue of Laplacian matrix. This 
significantly improves upon the previous algorithms, which required bfc ^ 0(Xk+i/k). 


1 Introduction 

In this paper, we study the following problem. If the solution of spectral relaxation for some 
k-w ay partitioning problem is close to an integral solution, can we still find this integral solution? 
The main difficulty is due to the rotational invariance of the spectral relaxation. The basis of an 
integral solution might be completely different than the basis of solutions for the spectral relaxation. 
Arguably, this is an important problem in spectral clustering, which is a widely used approach for 
many data clustering and graph partitioning problems arising in practice. 

In spectral clustering, one uses the top (or bottom) ^-eigenvectors of some matrix derived from 
the input (usually the Laplacian or adjacency matrix of some graph derived from the distances 
or nearest neighbors) to find a /^-partition. If the clusters are separated in a nice way, then these 
^-eigenvectors will be close a /c-partition up to an arbitrary rotation. Hence a crucial part of spectral 
clustering methods is how to "round" these ^-eigenvectors to a close-by /.--partition. 

Formally, we study the problem of approximating a A;-dimensional linear subspace of M n by 
another subspace which is fc-piecewise constant: Every vector of this subspace has its coordinates 
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comprised of at most k distinct values. Or equivalently, given a k-by-n orthonormal matrix Y of 
the form Y = [y \,..., y n \ (think of Y as an embedding of n points in M fc ), our problem is to find a 
^-partition Y = {Tf,..., 7).} so as to minimize the total variance under any direction: 


min 

r 


max 

zeR fe :||z ||2 = l 


serues 


cs? 


Here cs is the mean of points in the cluster S. If we use C G M. nxk to denote the matrix of cluster 
means with each row of C being one of the cluster means, then our objective can be stated more 
concisely as mine ||V — C\\■> with || • ||2 being the spectral norm. Geometrically speaking, this 
corresponds to finding a /c-piecewise constant subspace that makes the minimum angle with Y. 
This is the problem of clustering with spectral norm [KK10]. In 2-dimensions, where k = 2, optimal 
solution corresponds to one of the threshold cuts. From this perspective, our problem can be seen 
as a generalization of thresholding to higher dimensions. 

Our main contribution is a new spectral clustering algorithm that can recover a ^-partition 
whose center matrix C satisfies \\Y — C '|| 2 < 0(V OPT), where OPT is the minimum possible 
(observe that OPT < 1). Furthermore, the recovered /c-partition will be 0 (v OPTj-close in Jaccard 
index to the optimum partition: Each cluster we found will be close to a unique cluster among 
the optimum /.--partition. Previously, no algorithm was known to find a /.--partition closer than 
o{k • OPT). 

We also study two closely related problems. In the first one, the goal is to approximate a matrix 
in spectral norm by a block diagonal matrix, with every block being the normalized adjacency 
matrix of a clique. In our second application, we turn to the problem of /c-EXPANSION. Given an 
undirected, weighted graph G = (V, C ); find a /c-partition T = {Si,..., Sk} of the nodes so as to 
minimize the maximum expansion: 




def . C(S, S) 

— mm max-=—. 

r 5er min(|S|,|S|) 


( 1 ) 


Here C(S, S ) denotes the total weight of edges crossing S. Our second application is for approxi¬ 
mating the optimum /c-partition of /.:- EXPANSION] on graphs whose spectrum grows faster than W 
(we will make this precise later). 

The choice of spectral norm to measure the closeness of associated subspaces is quite natural 
from the perspective of our second application. Given any subspace, we show how to construct 
graphs in polynomial time, such that approximating /c-EXPANSION on such graphs implies a 
solution for the spectral clustering problem. From this perspective, we can see that the subspace 
rounding problem is a prerequisite toward obtaining a o(/,:)-factor approximation algorithm for 
the problem of /c-EXPANSION, where the best known is 0{k A ) due to [LGT14], 


1.1 Related Work 

Spectral methods have been successfully used for clustering tasks [Boll3] arising in many different 
areas such as VLSI [AKY99], machine learning, data analysis [NJW01] and computer vision [SMOO, 
YS03]. They are usually obtained by formulating the clustering task as a combinatorial optimization 
problem (such as sparsest/normalized cuts [SMOO]), then solving the corresponding basic SDP 
relaxation, whose solution is often given by k extremal eigenvectors of an associated matrix. 
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One of the first spectral clustering algorithms with worst case guarantees was given in [KW04] 
for the graph partitioning problem assuming certain conditions on the internal versus external 
conductance. The problem of finding a /c-partition so as to minimize the spectral norm was 
first introduced by [KK10] in the context of learning mixtures of Gaussians. The best known 
approximation factor is 0(k ) due to [AS12], 

A problem closely related to spectral clustering is /c-expansion, as defined in (1). When 
all cluster sizes are constrained to be nearly equal, this problem admits a 0(Vlog n log k)■ -factor 
approximation [BFK ' l 1 ]. On the other hand, if a bi-criteria approximation is sought, then one can 
find (1 — Q(l))k clusters each of which has expansion at most O (v/log n log k) times the optimum 
[LM14]. 

If we look at the basic SDP relaxation of /c-EXPANSION, then the optimal fractional solution is 
given by the k smallest eigenvectors of the corresponding graph Laplacian matrix. In fact, this is 
the main motivation behind the usage of /^-eigenvectors for clustering tasks in practice. A natural 
question is whether one can "round" these eigenvectors to a /c-partition (the so called Cheeger 
inequalities). When k = 2, it was shown in [AM85] that simple thresholding yields a 2-partition 
with 0(y/4> 2 ) expansion, where 4>k is the optimal value for A:-EXPANSION. Later a better bound 
was given in [KLL+13], assuming there is some gap between eigenvalues. When k > 2, bi-criteria 
versions of Cheeger's inequality are known [ABS10, LRTV12, LGT14], Here the guarantees on the 
expansion are of the form 0{y/<j>k), where O hides the dependencies on logarithmic factors; but the 
algorithms can only find (1 — 0(l))fc parts. 

The problem becomes significantly harder when exactly k clusters are desired. In this case, 
it was shown in [LGT14] that a method similar to the one proposed in [NJW01] will yield a k- 
partition with maximum expansion 0{k A y/cfk). This is the best known approximation algorithm 
for /c-EXPANSION problem and, as of yet, there is no algorithm known which achieves a poly- 
logarithmic approximation. 

Perhaps the simplest case of /c-EXPANSlON is when there is a gap between the (k + l) ,st smallest 
eigenvalue of Laplacian matrix, Afc+i, and p/,: of the form > A. One might think of this as a 
stability criteria: It implies that all /c-partitions with maximum expansion < 0(d/,.) are 0(e)-close 
to each other. To put it in another way, approximating the optimum /c-partition is at least as easy as 
finding a /c-partition with minimum possible expansion among all its clusters. For the case of A: = 2, 
it is trivial to show that thresholding the second smallest eigenvector of Laplacian yields e-close 
partition to the optimal one. On the other hand, when k > 2, the best prior result is due to [AS12], 
which can find a /c-partition that is 0(/ce)-close to the optimal one. In other words, when e> p 
there is no algorithm known to find a non-trivial approximation of the optimum /c-partition. 

1.2 Organization 

We first introduce some useful notation and background in Section 2. After this, we state our 
main contributions in Section 3. Then we propose a new spectral clustering algorithm in Section 4. 
In Section 5, we will prove that our algorithm always finds a /c-partition that is ^/e-close to any given 
subspace, where e is the optimum. In Section 6, we discuss some applications of our algorithm. 
Our main applications will be: 

• (Section 6.1) Approximating a graph using disjoint union of expanders and, 

• (Section 6.2) /c-expansion when (p^ < 0(Afc + i). 
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Finally, in Section 7, we present a simple reduction from /.’-EXPANSION to our problem: This means 
any algorithm for /.’-EXPANSION has to solve our subspace rounding problem as well. 


2 Notation and Background 


Let [m] = {1,2,..., rn }. We will associate V = [n] with the set of nodes. For any vector q £ M T , 
q = pppP and q = f yhih- Given a subset S C V, we use C5 £ M n to denote the indicator vector for 


S, e s (i) = 


1 if i £ S, 

0 else. 

Matrices. We use M rxc to denote the set of r-by-c real matrices. Likewise, we use § c and SI C § c 
to denote the set of c-by-c symmetric and positive semidefinite matrices, respectively. Finally let 
S).(W') be the set of all n-by-fc orthonormal matrices (Stiefel manifold) for k < n: 


S k (R n ) d = |A £ R nxk 


A T A=i k y 


Given an r-by-c matrix A £ M rxc , we use cri(A), z £ {1,2,..., min(r, c)} to refer to the i th largest 
singular value of A. We define <7 m i n (A) as the minimum singular value of A, and |,4||2 as the 2-norm 
of A, which is j /111 y = &i(A). Likewise | A 11 p denotes the Frobenius norm of A, |/111 /? = v A T A = 
cr |(^)- Given matrix R C [r], C C [c], we will use .4 n c to refer to the minor corresponding to 
rows R and columns C. 

Finally, we will use A n , A £ S/| to denote the r-by-r projection matrices onto the column space 
and co-kernel of A, respectively. Observe that for any A £ S k (W 1 ), A u = A A 1 and A 1 - = I n — AA r . 

One way of measuring the closeness of two subspaces is to look at how much (in degrees) we 
need to rotate a vector in one subspace to the closest vector in the other subspace. It is well known 
that this quantity is related to the spectral norm. For completeness, we provide a formal version of 
this statement along with its proof: 


Proposition 2.1 ([SS90]). Given two linear k-dimensional subspaces ofW 1 with orthonormal basis A,Bg 
S k (R n ) respectively; the cosine of the largest angle between these two subspaces is given by the following: 


/ / im def - \( x ,y)\ 

cos (Z Ad) = nun max -—r—-—— 

irEspan(A) yGspan(B) ||x||2||y||2 


We have sin(ZAB) = \\A L B\\ 2 = \\B ± A\\ 2 . 


Proof. From the definition of ZAB, it is easy to see how it measures the maximum degrees necessary 
to rotate a point in A to any point in B and vice versa. We will now prove the second statement. 
Any point x in span(A) can be written as Ap for some p £ R k . Moreover A is orthonormal, thus 
||x|| = \\Ap\\ = ||p||. This allows us to rewrite cos (ZAB) as follows: 

l(®.2/)l • \(Ap,y)\ 

nun max -—= mm max -r——^—n - • 
a;£span(A) yespan(S) 11 a? 112112/112 P yGspan(B) ||p||2 112/112 

For any p, best y is given by B u Ap. Moreover ||B n Ap||| + ||F/- L Ap||| = \\p\\\, thus: 


= mm 
p 


\\B n Ap \\ s 


IIpII 


1 — max 
p 


\\B L Ap\\l 

bill 
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Consequently, sin(Z,4T>) = max p = 11 111. □ 

Definition 2.2. Let Set v(k) be the family of sets of k non-empty subsets ofV. We ivill use Disj v (k) C 
Set v{k) to denote the set ofk disjoint subsets ofV: T G Disj^(A;) if and only ifT G Set y(k) and SnT = 0 
for all S f T €T. 

In order to compare subspaces with fc-partitions, we need to identify a canonical representation 
of the subspaces associated with A-partitions. The most natural representation is to use each basis 
vector as the normalized indicator of one of the clusters. 

Notation 2.3 (Basis Matrices of A-Partitions). Given k-subsets T = {A\, ..., A ofV, let T G M nxfc 

be the corresponding normalized incidence matrix T = f [eTJ • • • eTjT]. We zvill use T 11 G S” and 
r- 1 G S” to denote the associated projection matrices so that r n r = T and T T = 0. 

Multiplication with either of the projection matrices T ri and T has a natural correspondence 
with means and the differences to means: 


Proposition 2.4. 1/r G Disjy, then T is an orthonormal matrix, T G 5fc(M n ), and T -1 is a Laplacian 
matrix. For any Y G R kxn , i th column of: 

• yr n is the mean of points in the same cluster with i provided i is in any cluster ofT, and 0 otherwise. 

• yr^ is the difference between m and its associated center as defined above. 

For example, ||yr J -|||. measures the sum of squared distances of each point to the center of its cluster or 
origin if it is not in any cluster. 

We will measure the distance between sets in a way similar to cosine distance. 

Notation 2.5. Given p,q G M n , we define A(p,q) as A(p,q) l = 1 — ( p,q ) 2 . Note that A (p,q) = 
2 1| p® 2 — g® 2 1| 2 . For convenience, we zvill use A (S, q) as A(es, q). In particular, A(^4, B) = 1 — 

Our measure of set similarity is closely related to the Jaccard index. 

Proposition 2.6. For any pair of subsets A,B C V: 


1 \AAB\ 
4\AUB\ 


< A(A,B) < 


\AAB\ 

\AUB\ 


Proof. Since |^4U i?| 2 > |A||i?|, we immediately see that 1 — A (A, B) > . For the other direction, 

suppose A(zl, B) < e and |A| > \B\. Then (1 — e)i/|d]]i?| < \An B\ which implies 


AnB 


>(i 



e) 


| AnB 

~1aT 


Therefore |^4nT>| > 
In particular. 


(1 — e) 2 |^4| and \ AAB\ = \A\ + \B\-2\AnB\ 


\AAB\ 

\A\JB\ 


> 4A {A,B). 


< \B\-(l-4e + 2£ 2 )\A\ <4e\B\. 


□ 
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We will now generalize our set similarity measure to ^-partitions. 

Notation 2.7. Given e Sety(fc); we define A(r, T) as: 

A(r,r)= f min max A (S, tt(S)). 

7r:r«.r Ssr 

We say A (resp. D is e-close to B (resp. T) whenever A (A, B) < e (resp. A(r, T) < e). 

Observe that our notion of proximity is a very strong bound. For example if F is e-close to T*, 
then any subset S' £ T* of size \S\ < j has to be preserved exactly in F. 

The next theorem says that the similarity measure we use for /'-partitions in Notation 2.7 is 
tightly related to the spectral norm distance between the corresponding basis. 

Theorem 2.8. Given T,T e Disjy(/c); A(T, T) < ||r J -r ||2 < 2A(T,T). Moreover, after appropriately 
ordering the columns ofT, ||F — r||| < 4A(T, T). 

Proof of Theorem 2.8 is given in Section 8.1. 

Proposition 2.9. Given A, Be S k (R n ), a min (A T B ) = y/l - \\A^B\\ 2 . 

Proof. B t A ± B = B t B - B t AA t B = I k - B T AA T B. Since \\A T B\\ 2 < 1, \\A L B\\ 2 2 = 1 - 
a min {A T B) 2 . □ 


Consider two subspaces with basis A and B. If the angle between these two subspaces is small, 
then one might intuitively expect that AA T and BB T are very close also. In the next lemma, we 
make this intuition formal. We also include its proof for completeness. 

Lemma 2.10 ([SS90]). Given A,B e S k (R n ); \\AA T - BB T || 2 = \\A ± B\\ 2 . 

Proof We will prove < by upper bounding the spectral norm of (A A 1 — /i/I 7 ) 2 . Since A A 1 = A u 
and BB t = B u : 

(AA t - BB t ) 2 =A n + B u - A u B n - B U A U 

=A U B ± + B n A ± . (2) 


If this matrix is zero, then our claim is trivially true. Suppose not. Consider the largest eigenvalue 
a of eq. (2) and a corresponding eigenvector q. We have 0 crq = (Af^B 1 - + B u A ± )q, which means 
either B L q / 0 or A^q / 0 (or both). Without loss of generality, we may assume A^q / 0: 

aq = (AP-B 1 - + B n A ± )q ==> aA ± q = A ± B n A ± q. 

Consequently, q' = f A 1 q is an eigenvector of /F lP ] A- with eigenvalue a: 

A L B n A L q’ = A ± B n A ± q = oA^q = aq . 

In particular, ||2L4 T — BB T \\ 2 = ||(2L4 T — BB T ) 2 \\ 2 = a < f|A- L 5 n A J -||2 = \\A ± BB T A 1 -\\ 2 = 
HA^/ill 2 . Now we will prove >. If we multiply both sides of eq. (2) with A L , we see that 

(. AA t - BB t ) 2 F A ± (AA t - BB t ) 2 A l = A ± B U A ± 


which implies 


\\(AA t - BB T ) 2 \\ 2 > \\A ± B\\l 
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2.1 Graph Partitioning 


Given an undirected graph G = ( V. C ) with nodes V and non-negative edge weights C, we use 
Ac £ §' and Lq € §+ to denote the adjacency and Laplacian matrices of G. Consider the following 
k-way graph partitioning problem where the goal is to minimize the maximum ratio of the total 
weight of edges cut and the number of nodes inside among all clusters. 

Definition 2.11 (/c-EXPANSlON). Given an undirected graph G = (V, C) with nodes V and non-negative 
edge weights C, we define the k-way expansion of G as the following: 


MG) 


def 


min max 
reDisj^fc) Ter 


C(T, T) 

\T\ 


Here C(A,B) denotes the total weight of unordered edges between A and B. For fixed G, we will use 
T* e Disjy (&) to refer to the k-partition which achieves fk{G). 

At the first glance, our notion of expansion might seem different than the usual definition given 
in eq. (1). However they are indeed the same: 

Proposition 2.12. For any G = (V, C) and a k-partition ofV , T E Disj^(fc), 

C(T,T) C(S,S ) 

max ———- = max-=—. 

Ter \T\ Ser min(|5|, |S|) 

In particular, fk{G) = <&k{G). 

Proof. For any S, minds’!, |S|) < |S|, therefore < ~Q§pj^|y- Now we will prove the other 

direction. Let f = maxTgr . For any V E T: 

C{T\ T) < E c ( t , t )<0 e m = w\n 

Ter\T' Ter\T' 


Consequently, 


C(T' ,T') 

inn 


C{T',T’) 


\T'\ 


< f. Recall that C ^ T ’,^ - 


< f, so 


C(T',T') 

min(|T , |,|T'j 


<n 


□ 


We can capture the objective function of <pk using the spectral norm, within a factor of 2: 

Lemma 2.13. Given T E Disjy(A:), ^HF^AF ||2 < max^er — < ||r T Lr||2. 

Proof. Let 0 = f rnax^gr C 'y'\ ’ ■ We need to prove o < cr max (r i FT) < 2f>. The lower bound is 
trivial, so we only give the proof of upper bound. Note T = JD -1 / 2 where I) is a matrix whose 
diagonals are |T| for Tel and the columns of J are indicator vectors for every T E I\ Then 
T t LT = D -1 / 2 J T LJD~ x G. Define W as the matrix which is equal to .J T LJ along its diagonals 
and 0 everywhere else. Since J T LJ is a Laplacian matrix ,J T LJ A 2W. Therefore: 

T t LT A 2D~ 1 GwD~ 1 / 2 . 

D~ 1 / 2 WD~ 1 / 2 is diagonal whose entries are eT ^ T = c ^j r ^ < 0 over all T E T. Consequently, 

u max (r T Lr) < 2u max (D” 1 /2 ZjD -i/2 ) < 


M 
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Given Lemma 2.13, a simple relaxation for f k {G) (the basic SDP relaxation) is the following: 

min \\Q t LQ\\ 2 st Q T Q = I k . (3) 

Note that for any T <G Disjy (fc), T is feasible and eq. (3) is indeed a relaxation. Moreover the Courant- 
Fischer-Weyl principle implies that the optimum value of eq. (3) is X k with the corresponding 
optimal solution being the smallest ^-eigenvectors of L. Therefore: 


Afc < 4>k{G ). 


(4) 


3 Our Contributions 


We re-state our main problem. Given a k- by- n matrix Y : Y T G S k (M. n ) of the form Y = [yi,... ,y n \ 
(think of Y as an embedding of n points in M. k ), find a ^-partition T G Disjy (k) so as to minimize 
the total variance under any direction: 


min 

r 


max 

zeK fe :|p||2=l 


^2^2(z,y u - gs) 2 . 

serues 


(5) 


Here c u is the mean of points in the same cluster with u if one exists, and c u = 0 otherwise. This is 
the problem of clustering with spectral norm [KK10]. 

Remark 3.1 (Covering all points). For simplicity, we allozo some points to be left uncovered by any set in 
T. However the same guarantees still hold even if we require T to cover all points: We arbitrarily assign 
uncovered points to clusters while making sure that the relative cluster sizes do not change. This procedure 
changes the approximation ratio by a factor of 2. 

We can express eq. (5) more succinctly as the following: 


min Ilyr-Hl^ 
r II 11^ 


Proposition 2.4 

= eq. (5) 


( 6 ) 


There are two closely related problems, whose optimum is within square root of eq. (5) (Lemma 2.10): 

• Finding a k-by-k rotation matrix R : R T R = RR T = //, and a /. -partition T G Disj v -(A:) so as to 
minimize the following: 


min \\RY — r|| 2 . (7) 

r, r 

• Approximate the Gram matrix of Y, Y r Y, using block diagonal matrices with each block 
being constant. This is equivalent to: 

mm||y T y- rr T || 2 . (8) 

Our main contribution is a new spectral clustering algorithm whose pseudo-code is given through 
Algorithms 1 to 5. We prove the following guarantee on its outputs. 
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Theorem 3.2 (Restatement of Theorem 5.12). Let T* E Disjy(fc) with ||Yr * _L ||2 < 0(e). Then 
T «— SpectralClustering(Y) is a k-partition so that T e Disjy(/c), and it is O(yfe) close to both T* 
and Y: 

A(T*,f) < 0(\Ze) and ||Yf- L ||!<0( > /£). 

Remark 3.3 (Small Clusters). Our main guarantee as stated in Theorem 5.12 ivorksfor any cluster size. 
Tor example, consider the case of some optimal cluster T E T* having size |T| < 0( 1/y/e). Tor such cluster, 
any S zvith A(S, T) < 0(y/e) has to be exactly equal to T. 

In other zvords, our algorithm will recover any T E T* with |T| < 0(l/y/e) exactly. 

As an easy consequence, we show how to approximate a graph as a disjoint union of expanders 
(provided one exists) in polynomial time. 

Corollary 3.4 (Restatement of Corollary 6.4). Given a graph G, if there exists T* E Disjy (k) such that 
Laplacian ofG is e-close (in spectral norm) to the Laplacian corresponding to the disjoint union of normalized 
cliques on each T E T*: 

||L - r /|| 2 < e, 

then in polynomial time, zve can find T E Disj K (fc) zvhich is 0[^/e)-close to T* and G: 

||L — r _L || 2 < 0 (e 1/4 ). 

Next we significantly improve the known bounds for recovering a /c-partition when all clusters 
have small expansion as in Definition 2.11. Previous spectral clustering algorithms only guarantee 
recovering each T E T* when the (k + 1) ,,/ smallest eigenvalue, A/, + i , of the associated Laplacian 
matrix for G satisfies 

Afc+i A Li(k • (f>k)- 

Our new algorithm significantly relaxes this requirement to A/,, + | > L(p/ ;; ). 

Theorem 3.5 (Restatement of Theorem 6.1). Given a graph G zvith Laplacian matrix L, let T be the 
k-partition obtained by running Algorithm 3 on the smallest k eigenvectors of L. Then: 

Finally, we show that any approximation algorithm for fc-EXPANSION implies the same approx¬ 
imation bound for the spectral clustering problem restricted to orthonormal matrices. In other 
words, the spectral clustering problem is a prerequisite for approximating EXPANSION even on 
graphs whose normalized Laplacian matrix has its (k + I ) sl eigenvalue Xk+i larger than a constant. 

Theorem 3.6 (Restatement of Theorem 7.1). Given Y : Y T E <S/,(K rt ), let L, = f argmin r | LL- H 2 
with e = \\Yrf\\l Then there exists a zveighted, undirected, regular graph X, zvhose normalized Laplacian 
matrix has its (k + l) st smallest eigenvalue at least Xk+i > 1 — 0(\fe) such that: 

• Each T E T* has small expansion, <f>x(T) < 0(e); 

• If r E Disjy (k) is a k-partition zvith rnax^gp fx(S) < 6, then A(T, T*) < 0(6 + \fe). 

Moreover such X can be constructed in polynomial time. 
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4 Our Algorithm 

The pseudo-code of our clustering algorithm and its sub-procedures are listed in Algorithms 1 to 5. 
The main procedure is invoked by T = SpecTRALCluSTHRING(F) (Algorithm 3), where Y is a 
k-by-n matrix Y such as the smallest ^-eigenvectors of some Laplacian matrix. The output T is a 
^-partition close to Y. We use T* = (T\.... . T k ) to denote the closest /^’-partition to Y. We will refer 
to T's as true clusters. 

4.1 Intuition 

First we start with the discussion of some of the main challenges involved in spectral clustering 
and the intuition behind the major components of our algorithm. 

Finding a Cluster. Since there are k directions and k clusters, we can think of each direction being 
associated with one of the clusters. Moreover, for one of the true clusters T, the total correlation of 
its center with remaining directions will be very small. By utilizing this intuition, we can easily 
find such a subset, say S (Algorithm 4). However this property need not be true for all T's: Even 
though each T will be at most e-close to every non-associated direction, the total correlation might 
be (k — l)e 1, thus overwhelming the correlation it had with its associated direction. In fact, 
this is the reason why /c-means type procedures will fail to find every cluster when e > 1/7.:. To 
remedy this, each time we find S, we can try to "peel" it off. A natural approach is to project 
the columns of Y onto the orthogonal complement of the center of S. Similar ideas were used 
before in the context of learning mixtures of anisotropic Gaussians [BV08] and column based matrix 
reconstruction [DR10, GS12], After projection, we obtain a new (/;: — l)-by-n orthonormal matrix, 
Z , corresponding to remaining k — 1 clusters. 

Boosting. Unfortunately we can not iterate the above approach: No matter how accurate we are 
in S, there will be some error: If Y were e-close to (Tf,..., T k ), then we can at best guarantee that 
Z is 2e-close to (T 2 ,.... T k ). After k iterations, our error will be 2 (>ik, s, which is much worse than 
/;:-means! In our algorithm, we keep the error from accumulating via a boosting step (Algorithm 1). 

Unraveling. Even with boosting, there remains one issue: The clusters we found may overlap with 
one another. Unlike other distance based clustering problems such as /.’-means, the assignment 
problem ("which cluster does this node belong to?") is quite non-trivial in spectral clustering even 
if we are given all fc-centers. There is no simple local procedure which can figure out the assignment 
of node u by only looking at y u and the cluster centers. We deal with this issue by reducing the 
ownership problem to finding a matching in a bipartite graph (Algorithm 5). Our approach is 
very similar to the one used in [BS06] for a special case of Santa Claus problem. Unfortunately, 
this operation has a cascading effect Adding a new cluster might considerably change the previous 
clusters and their centers. Dealing with this challenge is what causes our final algorithm to be 
rather involved. 

Final Algorithm. The final difficulty we face is that, the boosted cluster might cannibalize other 
much smaller clusters. We overcome this issue by maintaining both estimates for every true cluster: 
a coarse estimate, which is the core we originally found; and the finer estimate obtained after 
boosting. Due to the cascading effect of unraveling whenever we add a new cluster, we have to 
re-compute the centers and project onto their orthogonal complement at every round. 
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4.2 Overview 


Our algorithm proceeds iteratively. At r th iteration, it finds a core S r for one of the yet-unseen true 
clusters, say T r . The main invariant we need from the core set is the following (Lemma 5.7): 

(i) S r is noticeably close to T r , say A (S r ,T r ) < 

(ii) All the remaining k — r true clusters, {T r +i,.... 7),}, have small overlap with S r in the sense 

that \Tj fl S r \ < over all j > r. 

After having found S r , our algorithm needs to boost S r to S r . For boosting to work however 
(Theorem 5.8), we need the invariant (ii) to be true for all j f i. We do this by (U \...., U r ) «— 
Unravel(S'i, ..., S r ). Since U,'s are close to Sf s, each Tj with j < r will still be mostly overlap¬ 
ping with Sj (Theorem 5.6); hence Tj 's with j < r can not overlap with U r as Uj's are disjoint. 
Consequently we can use U r instead of S r for boosting so as to obtain S r . The only invariant we 
require from S r is that it is much closer to T r than S r : 

A (S r ,T r )<0{V~e). 

By using the centers of boosted sets instead of core sets, we can make sure that the error does not 
accumulate after the projection step Z a- {YT')- Y (Lemma 5.5). 

After k iterations, we apply Unravel to all boosted sets (Si,..., S&) one last time and output 
the result. 

Remark 4.1. In Algorithms 1 and 4, the last step involves computing the top singular vector. In both cases, 
we have: 

• A good initial guess (the indicator vector), 

• Large separation between <j\ and 02 . 

Thus, we can simply use power method for 0( log 1/e) many iterations to compute a sufficiently accurate 
approximation of the top right singidar vector, from which we can obtain an approximation of the top left 
singular vector easily. It was previously shown in [GKB13] that power method is sufficient in the context of 
spectral clustering. 


5 Analysis of the Algorithm 

In this section, we prove the correctness of our algorithm. Our main result is the following. Its 
proof is given at the end of Section 5.5. 

Theorem 5.1 (Restatement of Theorem 5.12). Let T* e Disjy(fc) with ||TT *- L ||2 < 0(e). Then 
T SpectralClustering(U) is a k-partition so that T e Disjy(fc), and it is 0(y/i) close to both T* 
and Y: 

A(T*, f ) < 0(y/e) and ||yf- L ||l < 0(y/e). 

In order to keep the analysis simple, we make no effort toward optimizing the constants. 
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Algorithm 1 S = BOOST(Y, S). 


Algorithm 2 S = ROUND(g). 

1 .pi— top left singular vector of Y$- 
2. Return ROUND(y T p). 

1- F £- {{rt sq u > sq v } \ v £ V, s £ {±1}}. 

2. Return argmax 5gi? | (q, eg) |. 

Algorithm3 T = SPECTRALCLUSTERING(y). 


Algorithm 4 S = FindCluster(F). 

1. For r <r- 1 to rank(U) do: 

(a) f ' v- Unravel^, S 2 ,..., SUu)- 

(b) z <- (yf / ) ± u. 

(c) S r £- FINDCLUSTER(Z). 

(d) (Ui, ..., Ur) £- UNRAVEL(5i, ..., S r ). 

(e) Sr £- BOOST(Y, U r ). 

2. Return UNRAVEL(f). 

1. Choose S' £ [0,1] as the minimum for which the 
following returns some S. 

2. For each c £ V: 

(a) 7T be an ordering st < 11 . 

(b) m <- min jj II^(i) II 2 >1-5'}. 

(c) If 11^0) - Y c\\ 2 < then: 

i ■ S <r- 0(1),... ,tt (m)}. 

ii. q £- top right singular vector of Yg. 

iii. Return RoUND(g). 


Algorithm 5 F = UNRAVEL^r) (see Figure 1 for a sample graph construction.) 

1. Choose S' £ [0,1] as the minimum for which the following returns a matching. 

2. Construct a bipartite graph H = (L, /f\ E), where left side L is U: 

(a) For all S £ T, there is a block of |"(1 — 5)151] identical nodes in R. 

(b) For all S £ T and u £ S, there is an edge between u and all nodes in Bs- 

3. Find a matching that covers R. 

4. For all S £ T, S «— nodes matched to the block Bs- Return ( S ,... | S £ T). 


0 0 0 0 


■ ■ ■ ■ ■ 


0 0 0 0 


W 


Input r = {5i,5 2 ,5 3 ,5 4 }. 



Bipartite Graph. 


Figure 1 The graph constructed by UNRAVEL on input S\ = {a, 6, c, e}, S 2 = {d, e, /, g}, 
S 3 = {g,h,i,j,k}, S 4 = {k} for 5 = \. 


5.1 Preliminaries 

In the following proposition, we show that for any pair of symmetric matrices that are close to each 
other in spectral norm, if there is a gap between the largest and second largest eigenvalues; then 
the largest eigenvectors of both matrices will be very close to each other. This can also be obtained 
using Wedin's theorem [SS90], but we chose to give a simple and self-contained proof. 
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Proposition 5.2. Given A,B e § n with maximum eigenvectors p,q e M n : 


Proof. Suppose ||p| 


(P,?> > 1 - 


2\\A-B\\ 2 


<Jl{A) - (72(A) ' 

= 1. Let 6 '= || A — B\\ 2 and 6 = f (p. q) 2 . We have 
cri(yl) =p T Ap <5 + a 1 (B)(q,p) 2 + a 2 (B)\\p ± q\\ 2 


<5 + ((71(A) + 5)0 + (<7 2 (A) + 5)(1- 6) 
=5 + 6(<ji - a 2 ) + (a 2 + 25). 


Hence 0 > CTl ~ fT2 ~ 2 ^ . □ 

— <X1— CT2 

The main tool we use to identify the clusters will be the eigenvalues of principal minors of the 
Gram matrix, Y T Y. Basically, eigenvalues measure how much true clusters, T, overlap with given 
principal minor. We make this connection formal in the following claim: 


Claim 5.3. Given Y e M mxn , T e Disjy(r) and subset S C V, let p = 
any i: 

\af(Y s )-(p)U\<\\Y T Y-T n \\ 2 . 
Here (p)U is the i th largest element of p. 


(|5nr|/|T| | T G r). Then for 


Proof. We have ||H r y — r* n || 2 > \\YjY s — (r* n )s i 5 || 2 . Observe that the eigenvectors of 

are eYris with corresponding eigenvalue over all T' G T,. Thus p's are the eigenvalues of 


(r* n )s,s- 


□ 


Consider a principal minor corresponding to some S whose largest eigenvalue is large, and 
second largest eigenvalue is small. The previous claim implies that there is a unique optimal cluster 
T which is almost contained by S. However this is still not sufficient: S might be much larger than 
T. In the next lemma, we show that, one can take the top right singular vector of Yg and round 
(threshold) it to obtain another subset S C S which is now very close to T. 

Lemma 5.4 (Initial Guess). Given Y € W nxn , T* g Disjy(r) with — r* n || 2 < 5, and subset 

S C V, let q G M 5 be the top right singular vector ofYg. If we define o\ o-f(Yg) and o 2 = f <j 2 (Ys), then 
the subset S C S obtained by: 

S <— ROUND(g) 

satisfies the following. For T being argmax Ter ^ 


and 


TnS| >rt- 4S 


|r||S| 


<Ji — cr 2 


VT'^TgL: | T n S\ < (<7 2 + 5)\T’ 
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Proof. We will use <7i = af(Ys), u 2 = o\(Yg ) and A = o\ — <r 2 . Since q is the top right singular 
vector of Yg, ||ysg|| 2 = o\ and \\q\\ = 1. 

By Claim 5.3, | ppp — <J\ \ <8, which implies > o\ — 5. For any T' f T, ppTp < cr 2 + 8. 
Via Proposition 5.2, we see that: 


(q^Trs) 2 > 1 


28 28 

-> 1 - -r 

01 — 02 A 


(g,eg) 2 > 1 


28 

A' 


Provided that 8 < \ A, both (q, erns) and (q. eg) have the same sign. Therefore: 


(ern5,eg) > (q, e TnS )(q, eg) 


I Q e Tns|| \\q~ 


_ 4 8 

■eg >1 ——. 
S" - A 


Consequently, using the fact S C S, we see that any T' f T has \T' fl S\ < (o: + 8)\T'\ and 

Tnsns = TnS: 

4 8 |td5| 1 irnsi 

A ^/|TnS||Sj ^ (Tl “ S ^/|T||5| 

□ 


After we found new clusters, we iterate by projecting Y onto the orthogonal complement of 
their center. In the following lemma, we prove that, as long as the clusters were close to optimal 
ones, the projection preserves remaining clusters. 

Lemma 5.5. Given Y : Y T E S k (M. n ) and T E Disjy(r) ivith the cluster centers in T being linearly 
independent, suppose there exists T* E Disj v (k)oftheformT* = r*i±]p*: r* E Disjy(r),r" E Disjy (£:-?’) 
such that: 

• |jr — iY || 2 < a. 

• ||TT/|||<e. 

Then T" is a good spectral clustering for Z = f (yr)-'-y, in the sense that || ^'(r* ,/ )- L H 2 < s + a.In addition, 
O-l (Z) = . .. = (T k -r{Z) = 1 , a k - r +i{Z) = 0 . 

Proof. Note that Z has all singular values either 0 or 1: 

zz T = (yr)- L yy T (vr)- L = (yr)- 1 . 

Moreover (yr)- 1 has rank rank(y) — rank(yr) = k — r. Since spectral clustering is invariant under 
change of basis, we can assume Z E J£( fc_r ) Xn so that all singular values of Z are 1. This means Z 
is orthonormal. It is obtained from Y by a linear transformation, therefore T* is a good spectral 
clustering for Z also: 

e >||yr/y T || 2 > \\zv^z t \\ 2 
=||zz T -(zr*)(zr,) T || 2 
=\\h- r - (yr*)(yr*) T || 2 . 
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In other words. 


(1 - e)/ fc _ r <ZYj l Z T = Z[(Tj) u + (iy') n ] Z T 
=> £l k -r ±Z{Vj') L Z T - z T vJ l z T . 

Now we will upper bound ||Z(lV) n Z T ||2 = jj|ZlV ||2 = ||(yr)- L (TT * / )||2 by a simple Cauchy- 
Schwarz: 

||(yr)- L (yr*')||| =||(iT) ± (yr, / ) - (yr)- L (yr)||l 
=||(yr) ± y(r,'-r)||| 

<\\{YY) L Y\\1\\YJ -T\\l 

<||y||l||r*' - r||| < ||iy - r||| < a. 

As a consequence, || ^(r*")- 1 1|| < e + ||ZTV H 2 < e + a. □ 

Now we will start with the proof of correctness for our unraveling procedure. 


5.2 Correctness of UNRAVEL (Algorithm 5) 

We will now prove that if the input of UNRAVEL (r) (Algorithm 5) is a list of (possibly overlapping) 
sets which are close to some ^-partition (ground truth), then the output will be a list of k disjoint 
sets which are also close to the ground truth. Our algorithm is based on formulating this as a simple 
maximum bipartite matching problem. 

Lemma 5.6. Given T £ Set y (fc), if there exists T* £ Disjy(fc) which is 6-close to T, then UNRAVEL(r) (Al¬ 
gorithm 5) will output T £ Disjy(fc) such that: 

• For each S £ T, there exists U £ T with U C S and \U\ > (1 — (5)151. 

• T is 46-close to r*. 


Proof. It is easy to see that if all blocks are matched, then the resulting assignment is a collection 
of fc-disjoint subsets so it has the first property. For the second property, consider any S £ F with 
corresponding subsets U £ T and T £ T*. 


A(U,S) = 1- 


snu | 2 

-jswr 



\s\u\ 

| 5 | 


< (5. 


Hence \/A( U. T) < y / A(t7, S ) + y / A(5, T) < 2\/6. This implies T is 4<5-close to T*. 

Now we will prove that if (5-close T* £ Disjy (/,;) exists, then there is always a matching of all 
blocks. Suppose n : T •£)• T* is a matching which minimizes max 5 A(5, 7 r(,S')). Then: 


|5nvr(5)| 2 > (l-<5)|5||7r(5)| => \S n tt(S)\ > (1 - 6)\S\. 


By Hall's theorem, we have to show that for any set of right nodes, BCR, B's neighbors on the 
left, N(B), are more than B: A'(/i) > \B\. Observe that if B contains some nodes of block B$, 
then adding the whole block B$ to B does not increase A'(/i)|, because all nodes in /I 5 have the 
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same set of neighbors. Hence the only subsets we need to consider are of the form B = U,s' G /i Bs 
over all A C T: 


\N(B )| =|u Ss aS| > |u SeA (S n 7 t(5))| 

= J2\Snn(S)\>(l-5)J2\S\ = \B\. 

S&A s 

□ 


5.3 Correctness of FindCluster (Algorithm 4) 

Here we prove that, given an orthonormal basis Y, if it is close to a ^-partition up to rotation, 
then FindCluSTER(H) (Algorithm 4) will output a set which is close to one of the clusters in this 
^-partition. 

Lemma 5.7. Given Y E M mxn , whose singular values are 0 or 1, if there exists T* E Disjy(fc) with 
||y T y - r *n ||2 < 5 for some small enough constant 5, then FlNDCLUSTER(y) (Algorithm 4) will output 
S' C y such that: 


• There exists T E T* with A (S', T ) < 0(V~5), 

• Any T' ± T : T' E r* has \T' n S'\ < 0(V6)\T'\. 

Proof. \\Y t Y — Tj l \\l < <5 implies crk(Y) = 1 and ak+i(Y) = 0. Therefore ||1T*- L |||, < 5k. In other 
words, 

E diuiii - refill) = e (i - refill) < 

rer* Ter* 


So: 


E 


max(l, \\Y t \\f) - \\Ye T g 


< 2 5k. 


Ter, 

As a consequence, there exists T E T* and some c E T such that: 


II^tIIf > H^eTlli >1-2(5. 


45 > 


1 

W\ 


E 

uGT,v£T 


I y — V 

\ 1 U 1 7 


v\\2 — 


> 


5^||E u -y c 

«eT 


45 

1 - 25 


> E «H|y u ||2 



Let's define p u 


def \\Y U -Y C \\ 2 

~ PET 


and sort the nodes in ascending order so that p\ < p 2 < ■ ■ ■ < p n '■ 


Eu~||y u |[a [pu] < 1 ^ 2S < 9 5 provided 5 < 

By a simple Markov inequality, sum of all ||y u || 2 over u E T with p u < 3\/d is at least 1 — 3 y/5. 
Consequently, the smallest integer m for which i < u <m P^lP > 1 — 3\/d satisfies ^ 1<u<m || Y u — 

y c || 2 < 3 y/5. 
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From now on, we assume S' is a subset and S' : S' < 3 y/5 with: 


\\Ys\\ 2 F = '£, a ^ Y S^ 1 - S ' and 5> ^ E “ ^H 2 - 

j u£S 

Recall that variance is lower bounded by sum of the squares of all but largest singular values: 

E n y « - y di 2 ^ ral- - a i ( y s) = E °-i( y s). 

uGS j> 2 

Hence crfiYs) >1 — 2 5' and a\ ( Ys ) < S'. Provided that 8 < which implies S' < -T; 

(a i — 8) (1-) > 1 — 8 V8. 

V G\ (72 ' 

For such S, Lemma 5.4 tells us that the new subset S C S obtained by rounding the top right 
singular vector of Yg satisfies, for T = f argmax Ter ^ 

. A(S,T)<l-(a 1 -<i)(l-^) <8^. 

• For any T'/Te r*, | V n S\ < 4:y/5\T\. 

‘A 


5.4 Correctness of BOOST (Algorithm 1) 

As we mentioned earlier, if we keep finding clusters and removing them iteratively, the error will 
quickly accumulate and degrade the quality of remaining clusters. To prevent this, we apply a 
boosting procedure as described in S <— BOOST(K S) (Algorithm 1) every time we find a new 
cluster. The main idea is that, if S is close to some cluster in ground truth, say T, and far from 
others; then the top left singular vector, say p, of the vectors associated with 5 will be close to the 
ones 5flT. Unfortunately, we can not use simple perturbation bounds such as Wedin's theorem. 
We have to make full use of eq. (5) instead: Under projection by p, the vectors of T tend to stay 
together; therefore vectors in T \ S will be very close to the vectors in S (IT. Hence indeed p will 
be close most of T \ S in addition to S (IT. 

Theorem 5.8 (Boosting). Given Y £ Sk( M n ) mid T* e Disjy(fc) with ||TT*- L ||! < e, consider any 
subset S C V. Suppose there exists T 6 T* with |S n T| > (1 — ct)|T| such that for any T’ T £ T*, 
| S n V | < a\T’\for some a < Then for S B00ST(y, S) (Algorithm 1), S satisfies: 

A(-S, T) < coy/e 


for some constant cq < 50. 

Remark 5.9. Note that Theorem 5.8 allows us to convert a subset with non-negligible overlap into a 
subset which is very close. Unfortunately, the new subset we obtain is no longer guaranteed to have small 
intersection with other V f T. 
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of Theorem 5.8. By Lemma 2.10, for 5 == \/e: 

||y t y — r * n || 2 < (5. 

We will use <ji = f crf(Ys) and 02 = of (Us). By Claim 5.3, \a\ — |5 n T|/|T| | < 5 and <J 2 < {3 + 5. 
Since p is the left top singular vector of Yg, || Yjp\\ 2 = a\ . We define q = f Y T p so that ||gs|| 2 = (J\ : 

2 x 


HS| 


>1 - 


\SnT\/\T\ - \SnT\/\T\ 

5 45 

>1 - >1 -. 

1 -a ~ 3 

Since cri((T* n )s,s) - <72((r* n )s,s) >1 — 2 a, Proposition 5.2 implies: 

25 


(9) 


(erns,q) 2 > lks|| 2 (l - 1 _ 2q ) > ||<?s|| 2 (l - 4<$) 


def _ 

Let's use pa to denote the mean of p on subset A, pa = (eh- q)- We will assume, without loss of 
generality, psnT > 0. Hence psrr > Vl ~ 4<5. On the other hand, 

£ >p T YTXY T p = q T r, ± q 


> 


fr\ ^2 (qi - qj) 2 >\S nT\(psnT - pt)' 


\T\ 




Pt >Psct ~ 


|5nr|' 


( e r,<?) >V\T\tsct - \J jgp y| 




\T\ 


\snT\ 

Using eq. (9), we can lower bound this quantity as: 


(iksiivi - 45 - 


>1-4 5- > 1 — —5. 

V 3 5 

Since ||g|| = || Y T p|| < 1, we have A (p, T) < 1 — (ep, q) 2 < 11<5. Using Proposition 5.10, we see that 


S ROUND (q) satisfies: 


A (S,T) < 4A (q,T) < 445 = 44 y/e. 


□ 
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5.5 Correctness of ROUND (Algorithm 2) 

Possibly the simplest case for our problem is when Y is 1-dimensional, i.e. it is a vector. As we 
argued in the introduction, our algorithm boils down to simple thresholding in this case. 

Proposition 5.10. Given q / OG M. n ,for any T / 0, S «— ROUND(g) (Algorithm 2) satisfies A (S', T ) < 
4A (q,T). 

def 

Proof. Let £ = A (q,T). Without loss of generality, we may assume q\ > ... > q m > 0 > ... > q n , 
||g|| = 1 and (q, ey) > 0. We have: 


yr— < (£W t) Sj<|r| gj 

< m ax%a® < Asa® <|<9,es)|. 

- S' ^ “ 


So A(S, q) < e and i/A(S, T) < V / A(S, g) + y/A(q, T) < 2 v /i. 


;sn 


5.6 Correctness of SpectralClustering (Algorithm 3) 

Finally, we put everything together and prove the correctness of SPECTRALCLUSTERING(L) (Al¬ 
gorithm 3). In the following lemma, we will show that the algorithm will iteratively find sets 
Si, S , 2 > • • • (think of them as coarse approximations of Tf s) and Si, S 2 ,... (think of them as fine 
approximations of T/s) such that at each iteration, each S, and S, will correspond to a unique 
Tj. Moreover, each S, will have very small overlap with remaining T) 's coming after themselves. 
Even though S/s might still have large overlap with previous T/s for j < i, we can easily use 
UNRAVEL(r) (Algorithm 5) to rectify this issue. 

Lemma 5.11. Let T* G Disj^(fc) with p'T^H! < £ for some e < £ 0 , inhere eo G (0,1) is a con¬ 
stant. For any r g [fc], consider the sequences T = (Si,..., S r ) and T = (Si,..., S r ) as found by 
SpectralClustering (1^) at the start ofr th iteration. Then there exists an ordering o/T*: 


r* = (Ti,T 2 , ... ,T r , T r+ 1 ,..., Tp, 


defp, 


defp // 


with the following properties for some a < i and j3 < 100: 


(a) For every i < r: 

• A (Si,Ti) < a. 

• For all j > i, \Tj n S*| < a\Tj 

(b) For every i < r, A (Sj, Tf) < /3y/I. 


Proof. By induction on r. For r = 0, (a) and (b) are trivially true. 

Given r, suppose (a) and (b) are true with (Tf,..., Tj,.), r( and T" being as described. 
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At the beginning of (r + l) st iteration, we have F = (Si, ..., S r ) and F = (S'!,..., S r ). (b) means 
T is 0y/e-c\ose to F(. 

After T 7 <— UNRAVEL(r), by Lemma 5.6, T 7 is 4/3-^/e-close to T* and T 7 E Disjy(r). Using 
Theorem 2.8, we see that ||r 7 — F* 7 111 < 8 f3y/s. Now we can invoke Lemma 5.5, which implies: 

||^(r/ / ) ± ||| < e + 8PVe < 9/3Ve- 

Provided 9 (3y/e < do for some small enough constant do, we can use Lemma 5.7 to see that the 
subset S r+ i «— FindCluster(Z) satisfies 

A(S r+1 ,T) < a, 

and 

vr 7 e r" ==> |r 7 n s r+1 | < o|t 7 |. 

We reorder (T r+ i ,..., T k ) so that T r+ i = T. Then (a) holds true when i = r + 1. Since Ti,..., T r 
remain the same, (a) is true for alH < r + 1. 

Consider (Ui ,..., U r+ 1 ) •(— Unravel(T). By Lemma 5.6, we know that A (Ui, Tj) < 4a and 
Ui C Si,\Ui\ > (1 — a)|Si| for each i E [r + 1], In particular, for all i < r + 1: 

\UiHTil > (1 — 4a)|Tj|. 

U r+ 1 being a subset of S r+ i means \U r+ \ HTj\ < S r+ i F\Tj\ < a\Tj\ whenever j > r + 1. Now we 
will prove the case of j < r. Using the fact that U's are disjoint, for any j < r: 

\U r+ i fl Tj\ < \Tj\ — |Tj n Uj | < \Tj\ — (1 — 4a)|Tj| = 4a|7)|. 

Consequently, for any j f r + 1: 

| Ur+i nTj I < 4a|Tj|. 

After executing S r+ i BOOST(Y, U r+ i), noting a < we see via Theorem 5.8: 

A(S r +i,T r+ i) < coy/e = /3Vi- 

Combined with the fact that Si and T, remain the same for i < r, (b) also remains true for all 

i < r + 1. 

By induction, we now see that both (a) and (b) are true for all r < k. □ 

Theorem 5.12. Let T* E Disj v (k) with ||1TV|!| < 0(e). Then f «- SpectralClustering(F) is a 
k-partition so that T E Disjy (k), and it is 0(y/e) close to both T* and Y: 

A(T*,r )<0(y/e) and ||yf- L ||| < 0(yfe). 

Proof. By Lemma 5.11, T = (Si ,..., S k ) is /3Ve-close to T*. Lemma 5.6 implies that Unravel(T) 
outputs a disjoint collection of /c-subsets which is 4/3 Vi-close to T*. For the second bound: 

-||yf- L ||! <||yr*- L f- L ||| + ||yr* n f- L ||| 

/||vr -L 112 11 f^-L 112 I ||VP 112 11 -pi Tpl||2 

\ ||T -L * 11 2 * 11 F || 2 ' 11 F * ||2 ’ 11 F * F || 2 

<e+||r* T f ± ||!<e + 2A(r*,f) <0(y/l ). 

In the second to last inequality, we used Theorem 2.8. □ 
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6 Applications 

In this section, we will show some applications of our spectral clustering algorithm. 

6.1 /c-EXPANSION 

Our first application is approximating non-expanding ^-partitions in graphs. One may also interpret 
this as applying our subspace rounding algorithm on the basic SDP relaxation for /.--EXPANSION 
problem. 

Theorem 6.1. Given a graph G with Laplacian matrix L, let T be the k-partition obtained by running Algo¬ 
rithm 3 on the smallest k eigenvectors of L. Then: 



Remark 6.2 (Faster Algorithm). By slightly modifying our algorithm to take advantage of the underlying 
graph structure, one can obtain a faster randomized algorithm having the same guarantees with Theorem 6.1 
with expected running time 0(k 2 (n + m)). 


Proof From Lemma 2.13, we know that fk < cr max (r* T LT*) < 2b/,. Now consider the matrix 
Z = Y T , whose columns are the smallest k eigenvectors of L. We have L A A/, + i Z which means: 


Afc+i • r* T z ± r* a r* T zr* a 2 <fik. 




Thus cr fc (Z r r*) = (7 min (Z r/ r*) > yj 1 - 0(e) and: 

\\yt^\\ 2 2 =||z T r/z|| 2 = ||4 - z T r*r* T z|| 2 
=1 - o- min (T J'Z) < 0(e). 


The claim follows from Theorem 5.12. 


□ 


6.2 Matrix and Graph Approximations 

Our next application is for approximating a matrix in terms of fc-block diagonal matrices corre¬ 
sponding to the adjacency matrices of normalized cliques, under spectral norm. 

Theorem 6.3. Given a matrix X e S n , let e min r<igDis j ^ \\X — T* ri || 2 - In polynomial time, we can 
find T g Disjy(/c) such that A(r*, T) < 0(y/e) and: 


A-T n || 2 < 0(e 1/4 ). 


Proof. Let Y be the matrix whose rows are the top k eigenvectors of X. 
Consider T SpectralCluSTERING(T): 





which means 


||y T y-r*r* T || 2 < 2e =► \\ yt * x \\1 < 2s . 

By Theorem 5.12, A(T*, T) < 0(y/e) and p'TpH < 0(y/s)\ 

||rr T - x\\ 2 < e + ||r*r* T - rr r || 2 < o(e 4 / 4 ). 


□ 

Our final application is for approximating a graph Laplacian via another Laplacian corre¬ 
sponding to the graph formed as a disjoint union of k normalized cliques (expanders), again 
under spectral norm. Since we are working with Laplacian matrices, this means the new graph 
approximates cuts of the original graph also. 

Corollary 6.4. Given a graph G, if there exists T* £ Disjy(fc) such that Laplacian of G is e-close (in 
spectral norm) to the Laplacian corresponding to the disjoint union of normalized cliques on each T £ T*: 

||L - r/p < e, 

then we can find T e Disjy(fc) which is O[^fe)-close to T* and G in polynomial time: 

IlL-Tpla < 0(e 1/4 ). 

Proof Since ||L — || 2 = || (I — L) — r* 11 )^, we can apply Theorem 6.3 on the matrix I — L. The 

rest follows easily. □ 

7 ^-EXPANSION Implies Spectral Clustering 

In this section, we will show that approximation algorithms for various graph partitioning problems 
imply similar approximation guarantees for our clustering problem. 

Theorem 7.1. Given Y : Y T £ 5fc(M n ), let T* l = argmin r ||yr J -||| with e = ||TT*- L |||. Then there 
exists a weighted, undirected, regular graph X, whose normalized Laplacian matrix has its (k + l) st smallest 
eigenvalue Xk+i is at least \k+i > 1 — 0(y/e) such that: 

• Each TeL has small expansion, <f>x(T) < 0(e), 

• IfT £ Disjy (k) is a k-partition with maxs e r 4>x(S) < 5, then A(T, T*) < 0(6 + y/e). 

Moreover such X can be constructed in polynomial time. 

Proof. Consider the following SDP. Here we chose e £ [0,1] to be the minimum value where this 
SDP remains feasible: 

(i) X ^ Y n + y/sY x . 

(ii) YXY t t (1 - e)I k , 

(iii) X is doubly stochastic, diagonally dominant, PSD and has trace k. 
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It is easy to see that r* n is a feasible solution (Lemma 2.10). Moreover any feasible solution X 
corresponds to the adjacency matrix of a graph which is undirected and has all degrees equal to 1. 
Now we will show that it has the other properties: 

• Afc+i = 1 — &k+i(X) > 1 — a k+l (Y u + v^) = 1 — yje. 

• maxT 4>x{T) < 2\\Tj(I - X)r*|| 2 < 0(\\Y T (I - X)Y\\ 2 ) + e < 0(e). 

• Recall max# (f>x{S) > |||r T (I — X)r ||2 so <Jk{Y T r) > 1 — 0(6 + e). In particular <7fc(r* T r) > 
1 — 0(yfe + d). Using Theorem 2.8, we have A(T, T*) < 0(y/e + <5). 


□ 


8 Omitted Proofs 

8.1 Proof of Theorem 2.8 

Theorem 8.1 (Restatement of Theorem 2.8). GivenT,Y E Disjy(fe); A(T, T) < ||r- L r||| < 2A(T,T). 
Moreover, after appropriately ordering the columns oft 1 , ||T — T ||| < 4A(T, T). 

Upper Bound. Recall that is a Laplacian matrix. From Lemma 2.13, we see that ||r T r- L r ||2 is 
within factor-2 of the maximum diagonal element. Given T E T let S be the matching set in T so 
that A(5, T) < A(T, T). The diagonal element of T T r ± r corresponding to T is: 

er T T ± er < ||e^e^|| 2 = 1 - = A(5,T) < e. 

Therefore 11 r~ L r 111 < 2e. Before proving the lower bound, we will show how this bounds ||T — T^: 

2 (r - f ) T (r - f) = 4 - 2 ( rTf + f T r). 

So |||r - f||| < i - cT min (f r r) = i - \J i - jjFLfjjf < 2e. n 

Lower Bound. We define tti : T -> T and 7r2 : T —> T as the following: 

yjn _ -p , e xdef 1*5 DT| 

VS G T : 7Ti(S) = argmax ———, 

ref I 

W rp ^ p ( r T 1 \ def 1*5 FI T| 

VTgT: 7t 2 (T) = argmax———. 

Ser |S| 

Consider M = {(5, tt\ (S')) S' G I’}: By Claims 8.3 and 8.4, M is indeed a perfect matching between 
T and T. Now consider any matched pair (5, T) G M. Without loss of generality, say |S| > |Tj. By 
Claim 8.2, |5 n T| > (1 - e)|S|. Since |5AT| = |5| + |T| - 2|S n T|: 

|5AT| < |5| + |T| - 2(1 - e)|5| = 2e|5| + (|T| - |5|) < 2e|5|. 

We finish our proof with Claims 8.2, 8.3 and 8.4. 
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Claim 8.2. Ifir^S) = T, then |S n T\ > (1 - s)\T\. Similarly , z/vr 2 (r) = S, then |S n T\ > (1 - e)|S|. 

Proof. Consider the matrix P = T 7 rr T r € §+ so that A m m(P) = cr ^ in (r T T 1 ). Thus A min (P) = 
o~ min (r 7 T) 2 > 1 — e. In particular, all diagonals of P are at least 1 — e. Consider any diagonal 
corresponding to S' G T: 


T egff r e 5 v |SnT | 2 

' ■ |5| ^ \S\\T\ 

Ter 


< [ max 
y T'eP 


\SHT' 

\T'\ 



snr| 


isnr'1 

= max , 

T'er I J- I 

which, by construction, is equal to • This proves the first part of the claim. The second part 

follows immediately by applying the same argument on T and T. □ 


Claim 8.3. Both tt\ and 7r 2 are bijections. 

Proof. Suppose vri(S) = vri(S') = T for some S f S'. Since S, S' are disjoint and e < f. 


|Tj > |SnT| + |S'nT| > 2(1 — e)\T\ > \T\, 


a contradiction. A similar argument shows that 7r 2 is a bijection as well. 

Claim 8.4. tt\ = tt T 1 . 


□ 


Proof. Suppose not. Since both T and T are bijections by Claim 8.3, there exists a cycle of the form 

(So, Tq, ..., S m _i,T m _i, S m = So) 


where vri(Sj) = T t and 7r 2 (Tj) = Sj+i for some m > 2. By construction, |Sj n Tf\ > (1 — e)|Tj| which 
means e\T t \ > |Tj \ Sf. Since S* and Sj+i are disjoint, |Tj \ Si\ > \T t n Sj + i|. Again, by construction, 
|T-nS i+ i| > (1 — e)|Sj + i|. Therefore e|Tj| > (1 - e)|S i+ i| which implies \T\\ > ^|Sj+i| > |S i+ i| 
since e < 1/2. By a similar argument, we can also show that |Sj| > T, \. Consequently, | So | > |Si| > 
... > | Sm. | = | So | which is a contradiction. So all cycles have length 2, which implies tt\ = tt 2 1 . □ 

□ 
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